過渡到電腦視覺：為什麼使用CNN？

過渡到電腦視覺

今天我們從使用基本線性層處理簡單的結構化資料，轉向應對高維度的影像資料。一張彩色影像會帶來顯著的複雜性，標準架構無法有效處理。電腦視覺的深度學習需要專門的方法：卷積神經網絡（CNN）。

1. 為何全連接網路（FCN）會失敗

在全連接網路中，每個輸入像素都必須與後續層中的每一個神經元相連。對於高解析度影像而言，這會導致計算量爆炸式增長，使訓練變得不可行，並因極端過度擬合而導致泛化能力差。

Input Dimension: A standard $224 \times 224$ RGB image results in $150,528$ input features ($224 \times 224 \times 3$).
Hidden Layer Size: If the first hidden layer uses 1,024 neurons.
Total Parameters (Layer 1): $\approx 154$ million weights ($150,528 \times 1024$) just for the first connection block, requiring massive memory and compute time.

CNN 的解決方案

CNN 透過利用影像的空間結構來解決 FCN 的擴展問題。它們使用小型濾波器識別模式（如邊緣或曲線），將參數數量減少數個數量級，並提升模型的穩健性。

TERMINALbash — model-env

> Ready. Click "Run" to execute.

PARAMETER EFFICIENCY INSPECTOR Live

Run comparison to visualize parameter counts.

Question 1

What is the primary benefit of using Local Receptive Fields in CNNs?

Filters only focus on a small, localized region of the input image.

It allows the network to process the entire image globally at once.

It ensures all parameters are initialized to zero.

It removes the need for activation functions.

Question 2

If a $3 \times 3$ filter is applied across an entire image, what core CNN concept is being utilized?

Kernel Normalization

Shared Weights

Full Connectivity

Feature Transposition

Question 3

Which CNN component is responsible for progressively reducing the spatial dimensions (width and height) of the feature maps?

ReLU Activation

Pooling Layers (Subsampling)

Batch Normalization

Challenge: Identifying Key CNN Components

Relate CNN mechanisms to their functional benefits.

We need to build a vision model that is highly parameter efficient and can recognize an object even if it slightly shifts its position in the image.

Step 1

Which mechanism ensures the network can identify a feature (like a diagonal line) regardless of where it is in the frame?

Solution:
Shared Weights. By using the same filter across all locations, the network learns translation invariance.

Step 2

What architectural choice allows a CNN to detect features with fewer parameters than an FCN?

Solution:
Local Receptive Fields (or Sparse Connectivity). Instead of connecting to every pixel, each neuron only connects to a small, localized region of the input.

Step 3

How does the CNN structure lead to hierarchical feature learning (e.g., edges $\to$ corners $\to$ objects)?

Solution:
Stacked Layers. Early layers learn simple features (edges) using convolution. Deeper layers combine the outputs of earlier layers to form complex, abstract features (objects).